import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
os.chdir("C:\\Users\\ASUS\\Desktop")
data = pd.read_csv("clinic_data.csv")
data.head(3)
Age | Gender | AppointmentRegistration | ApointmentData | DayOfTheWeek | Status | Diabetes | Alcoolism | HiperTension | Handcap | Smokes | Scholarship | Tuberculosis | Sms_Reminder | AwaitingTime | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 19 | M | 2014-12-16T14:46:25Z | 2015-01-14T00:00:00Z | Wednesday | Show-Up | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -29 |
1 | 24 | F | 2015-08-18T07:01:26Z | 2015-08-19T00:00:00Z | Wednesday | Show-Up | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1 |
2 | 4 | F | 2014-02-17T12:53:46Z | 2014-02-18T00:00:00Z | Tuesday | Show-Up | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1 |
Age of patient
Gender of patient
Date on which appointment was issued to the patient
Date for which appointment was issued to the patient
Day of the week for which appointment was issued
Day of the week for which appointment was issued (dependent variable)
Whether the patient has diabetes or not
Whether the patient is affected by Alcoolism or not
Whether the patient has HiperTension or not
Whether the patient is handicapped or not
Whether the patient smokes or not
Whether the patient has tuberculosis or not
Whether or not a patient has been granted scholarship from a social welfare organization or not. Poor families may benefit by receiving financial aid.
Whether SMS reminder for appointment has been issued to the patient or not
AwaitingTime = AppointmentRegistration – ApointmentData
Discover reasons that losses are coming up even though the rate of appointments is going up?
If patients are not reporting at the time of their scheduled appointments, come up with a method to determine whether a patient would show up on the basis of his/her characteristics. She believed that knowing which patients were likely not to show up would enable the hospital to take countermeasures like the following:
Provide constant appointment reminders and confirmations Make the head count of doctors and hospital staff in line with the demand at hand.
data.head(3)
Age | Gender | AppointmentRegistration | ApointmentData | DayOfTheWeek | Status | Diabetes | Alcoolism | HiperTension | Handcap | Smokes | Scholarship | Tuberculosis | Sms_Reminder | AwaitingTime | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 19 | M | 2014-12-16T14:46:25Z | 2015-01-14T00:00:00Z | Wednesday | Show-Up | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -29 |
1 | 24 | F | 2015-08-18T07:01:26Z | 2015-08-19T00:00:00Z | Wednesday | Show-Up | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1 |
2 | 4 | F | 2014-02-17T12:53:46Z | 2014-02-18T00:00:00Z | Tuesday | Show-Up | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1 |
data.describe()
Age | Diabetes | Alcoolism | HiperTension | Handcap | Smokes | Scholarship | Tuberculosis | Sms_Reminder | AwaitingTime | |
---|---|---|---|---|---|---|---|---|---|---|
count | 300000.000000 | 300000.000000 | 300000.000000 | 300000.000000 | 300000.000000 | 300000.000000 | 300000.000000 | 300000.000000 | 300000.000000 | 300000.000000 |
mean | 37.808017 | 0.077967 | 0.025010 | 0.215890 | 0.020523 | 0.052370 | 0.096897 | 0.000450 | 0.574173 | -13.841813 |
std | 22.809014 | 0.268120 | 0.156156 | 0.411439 | 0.155934 | 0.222772 | 0.295818 | 0.021208 | 0.499826 | 15.687697 |
min | -2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -398.000000 |
25% | 19.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -20.000000 |
50% | 38.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | -8.000000 |
75% | 56.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | -4.000000 |
max | 113.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 1.000000 | 1.000000 | 1.000000 | 2.000000 | -1.000000 |
n = data.nunique(axis=0)
n
Age 109 Gender 2 AppointmentRegistration 295425 ApointmentData 534 DayOfTheWeek 7 Status 2 Diabetes 2 Alcoolism 2 HiperTension 2 Handcap 5 Smokes 2 Scholarship 2 Tuberculosis 2 Sms_Reminder 3 AwaitingTime 213 dtype: int64
def features_plots(discrete_vars):
plt.figure(figsize=(20,30))
for i, cv in enumerate(['Age', 'AwaitingTime']):
plt.subplot(7, 2, i+1)
plt.hist(data[cv], bins=len(data[cv].unique()))
plt.title(cv)
plt.ylabel('Frequency')
for i, dv in enumerate(discrete_vars):
plt.subplot(7, 2, i+3)
data[dv].value_counts().plot(kind='bar', title=dv)
plt.ylabel('Frequency')
discrete_vars = ['Gender', 'DayOfTheWeek', 'Status', 'Diabetes','Alcoolism', 'HiperTension', 'Handcap', 'Smokes',
'Scholarship', 'Tuberculosis', 'Sms_Reminder']
features_plots(discrete_vars)
Age: Age lay in the range of -2 and 113. Age between 0 and 113 did make sense, but what surprised her was how it could be negative. It seemed to her that these were the outliers.
Handicap: Instead of being Boolean, this feature had values in the range of 0 and 4.
Sms_Reminder: Instead of being a Boolean entity, it had values in the range of 0 and 2. It seemed to her that Sms_Reminder represented the frequency of reminders sent to each and every patient.
AwaitingTime: Dr. Judy was puzzled to see AwaitingTime in negative terms. By definition this feature represented the number of days from which the appointment was issued to the date for which the appointment was issued. She believed that positive numbers would have made more sense.
data = data.loc[data['Age'] >= 0,]
data.Handcap.value_counts(normalize=True)
0 0.981343 1 0.016994 2 0.001497 3 0.000130 4 0.000037 Name: Handcap, dtype: float64
data = data.drop("Handcap",axis=1)
data["AwaitingTime"] = abs(data["AwaitingTime"])
data.head(2)
Age | Gender | AppointmentRegistration | ApointmentData | DayOfTheWeek | Status | Diabetes | Alcoolism | HiperTension | Smokes | Scholarship | Tuberculosis | Sms_Reminder | AwaitingTime | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 19 | M | 2014-12-16T14:46:25Z | 2015-01-14T00:00:00Z | Wednesday | Show-Up | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 29 |
1 | 24 | F | 2015-08-18T07:01:26Z | 2015-08-19T00:00:00Z | Wednesday | Show-Up | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
data["Status"] = le.fit_transform(data["Status"])
data["Gender"] = le.fit_transform(data["Gender"])
dow_mapping = {'Monday' : 0, 'Tuesday' : 1, 'Wednesday' : 2, 'Thursday' : 3, 'Friday' : 4, 'Saturday' : 5, 'Sunday' : 6}
data['DayOfTheWeek'] = data['DayOfTheWeek'].map(dow_mapping)
discrete_vars = ['Gender', 'DayOfTheWeek', 'Status', 'Diabetes',
'Alcoolism', 'HiperTension', 'Smokes',
'Scholarship', 'Tuberculosis', 'Sms_Reminder']
features_plots(discrete_vars)
Dr. Joyita noticed that AwaitingTime seemed to decay in an exponential fashion. As per her observation, majority of the patients have an age of 0 (i.e., infants whose age is in months). She also pointed out to the hikes at the ages of 19, 38, and 57. Other than this, another surprising fact was that one-third of patients were males, and that the same proportion of patients didn’t show up at the date and time of their appointments. This information gave her a clue as to why the clinic was seeing losses despite of an increase in the number of appointments. She also noticed that majority of the patients were sent at least one SMS reminder; however, two-thirds of the time no reminder was sent. The absence of appointment reminders, she believed might be the reason behind patients not showing up.
Once she understood the features within the dataset and after she had removed the ambiguities by performing data cleaning. Dr. Joyita was interested in identifying relationships between different features within the dataset. She wanted to perform this multivariate analysis to gain an intuitive understanding of the types of patients who don't show up on their appointment dates and time .
Dr. Joyita had a preconceived notion, just like any other person, that people will require to see a doctor more as they grow old. Hence, she created a scatter plot, with the help of Analytics Educator, between Age and Awaiting Time.
plt.figure(figsize=(15,5))
sns.scatterplot(data=data,x="Age",y="AwaitingTime",hue="Status")
plt.xlim(0, 120)
plt.ylim(0, 120)
plt.show()
data_Analytics_Educator = data.groupby(['Sms_Reminder', 'Status'])['Sms_Reminder'].count().unstack('Status').fillna(0)
data_Analytics_Educator
Status | 0 | 1 |
---|---|---|
Sms_Reminder | ||
0 | 38915 | 89631 |
1 | 51546 | 119103 |
2 | 268 | 531 |
data_Analytics_Educator[[0, 1]].plot(kind='bar', stacked=True)
plt.title('Frequency of people showing up and not showing up by number of SMS reminders sent')
plt.xlabel('Number of SMS reminders')
plt.ylabel('Frequency')
plt.show()
Dr. Joyita noticed that number of people showed up after 1 sms reminder in comparison to the number of people showed up with 0 sms reminder is significantly more. ( about ((119103-89631)/89631)*100 ~ 32.8% more.
data_AE = data.groupby(['DayOfTheWeek', 'Status'])['DayOfTheWeek'].count().unstack('Status').fillna(0)
data_AE.plot(kind='bar', stacked=True)
plt.title('Frequency of people showing up and not showing up by Day of the week')
plt.xlabel('Day of the week')
plt.ylabel('Frequency')
plt.show()
She noticed that hardly any patient is coming on Saturday (day 5) and Sunday the clinic is closed. Otherwise, number of no show is pretty consistent through out the week. However, the distribution for showing up seems like a normal distribution.
Now Dr. Joyita has a lot of insights about the data and understood that the clinic was loosing money mainly due to no show of the patients. The clinic has to pay the fees for the doctor's time but there are not enough patients for the doctors. Although the patients are turning up at a later time, at the moment the doctor's capacity is getting underutilized.
Here, her expectation from the Machine Learning algorithm is by looking into the data the algorithm can predict whether a patient would show up or not. So this is a typical binary classification problem.
In machine learning, the greater the number of observations and feature sets within the dataset, the greater the likelihood that the model will capture the variability within it, to understand its true essence. Dr. Joyita didn't have an option to increase the number of observations since she didn't have any more data. So she decided to extract the different year and months from the Appointment Date. Once done, drop the original variable.
from datetime import date, time, datetime
data["app_date"] = data["ApointmentData"].str[:10]
data["app_date"] = pd.to_datetime(data["app_date"], format="%Y-%m-%d")
data["app_year"] = data["app_date"].dt.year
data["app_month"] = data["app_date"].dt.month
# dropping the variables app_date and ApointmentData
data = data.drop(["app_date","ApointmentData"],axis=1)
# She decided to drop AppointmentRegistration as well, since it will be of no other use
data = data.drop(["AppointmentRegistration"],axis=1)
data.head()
Age | Gender | DayOfTheWeek | Status | Diabetes | Alcoolism | HiperTension | Handcap | Smokes | Scholarship | Tuberculosis | Sms_Reminder | AwaitingTime | app_year | app_month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 19 | 1 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 29 | 2015 | 1 |
1 | 24 | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2015 | 8 |
2 | 4 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2014 | 2 |
3 | 5 | 1 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 15 | 2014 | 8 |
4 | 38 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 6 | 2015 | 10 |
data = pd.get_dummies(data=data,columns=['app_year', 'app_month'],drop_first=True)
# There are no missing values
data.isnull().sum()
Age 0 Gender 0 DayOfTheWeek 0 Status 0 Diabetes 0 Alcoolism 0 HiperTension 0 Handcap 0 Smokes 0 Scholarship 0 Tuberculosis 0 Sms_Reminder 0 AwaitingTime 0 app_year_2015 0 app_month_2 0 app_month_3 0 app_month_4 0 app_month_5 0 app_month_6 0 app_month_7 0 app_month_8 0 app_month_9 0 app_month_10 0 app_month_11 0 app_month_12 0 dtype: int64
Classification helps us decide which of the given classes a new observation will fall into. Classification comes under supervised learning where the model can only be trained once a membership labeled data is provided as an input. These membership variables are usually categorical variables which can be nominal as well as Boolean in nature. E.g. Suppose a man is applying for a loan in a bank, and the bank is keen to predict that in future whether the person will pay back his loan on time or be a default. The following methods can be used to evaluate a classification model:
Accuracy: Classifier and predictor accuracy
Speed: Time to train and predict from the model
Robustness: Handling missing values and noise
Scalability: Efficiency in disk-related databases
Interpretability: Predictions made by the model make intuitive sense
Python allows the provision of measuring classification performance with the aid of several score, loss, and utility functions. These metrics require probability estimates of confidence values, positive class, binary decision values or value within the sample_weight parameter (i.e., weighted contribution of each sample to the overall score). These can be divided in several ways.
Confusion matrix counts the true negatives, false positives, false negatives, and true positives.
True negatives is the frequency of instances in which the model correctly predicted 0 as 0.
False negatives is the frequency of instances in which the model predicted 1 as 0.
True positives is the frequency of instances in which the model correctly predicted 1 as 1.
False positives is the frequency of instances in which the model predicted 0 as 1.
# We remove the label values from our training data
X = data.drop(['Status'],axis=1)
# We assigned those label values to our Y dataset
y = data['Status']
# Split it to a 70:30 Ratio Train:Test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
Decision trees form a tree in a hierarchical fashion with each node having a decision boundary to proceed downward. The tree stops branching out at the level where there are no more splits possible. Interior nodes represent input variables having edges to each of the children. Children split the values from the input variable. They do that by partitioning the data at each level with nodes branching out to children. This behavior is known as recursive partitioning. Decision trees are easy to interpret and time efficient, and hence they can work well with large datasets. Decision trees can also handle both numerical and categorical data, that is, regression in case of numerical and classification in case of categorical data. However, the accuracy of decision trees is not as good as that produced by other machine learning classification algorithms.
Moreover, decision trees generalize highly to the training dataset and thus are highly susceptible to over fitting. A decision tree aims to partition the data so that each of the partitioned instances has similar/homogeneous values. Its algorithms are used to calculate the homogeneity of a sample, and if it is completely homogeneous it translates into an entropy of 0 and into a value of 1 or vice versa. A decision tree is a form of a parametric supervised learning method, and by parametric we mean that it can be applied to any data regardless of its underlying distribution.
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier.fit(X_train, y_train)
DecisionTreeClassifier()
Analytics Educator pointed out to Dr. Joyita that because no configuration parameters were passed to the decision tree classifier, it took the default values of configuration parameters. The next step was to apply the trained model on a testing dataset to find the predicted labels of Status. She then aimed to compare the predicted labels to the original label of Status to calculate accuracy of the model.
#predict the test data
y_pred = classifier.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test, y_pred))
precision recall f1-score support 0 0.33 0.36 0.34 27214 1 0.71 0.68 0.69 62785 accuracy 0.58 89999 macro avg 0.52 0.52 0.52 89999 weighted avg 0.59 0.58 0.59 89999
Analytics Educator explained to Dr. Joyita that here 0 means the people who didn't turn up for the appointment (no show). In the the above result (confusion matrix) it is shown that the precision for 0 is 33% meaning out of all the 0 predicted by the model, only 33% were correct. On the other hand, recall states that out of all the 0 in the data, only 36% were identified correctly.
Bagging, also known as a bootstrap method, optimizes on minimizing the variance. It does that by generating additional data for the training dataset using combinations to produce multisets of same size as that of the original data. The application of Bagging is ideal when the model overfits and you tend to go to higher variance. This can be taken care of by taking many resamples, each overfitting, and averaging them out together. This in turn cancels some of the variance. An ensemble method combines predictions from multiple machine learning algorithms, which result in relatively more accurate predictions than an individual model could have captured. Ensemble methods are usually divided into two variants (Bagging and Boosting). Decision trees are sensitive to specific data on which they are trained on. If training data is changed, the resulting decision tree can be quite different and can yield different predictions. A decision tree being a high-variance machine learning algorithm has the application of Bagging by means of the bootstrap procedure. Consider a dataset that has 50 features and 3,000 observations. Bagging might create 500 trees with 500 random observations for 20 features in each tree. Finally it will average out the predictions for all of those 500 tree models to get the final prediction.
Boosting defines an objective function to measure the performance of a model given a certain set of parameters. The objective function contains two parts: regularization and training loss, both of which add to one another. The training loss measures how predictive our model is on the training data. The most commonly used training loss function includes mean squared error and logistic regression. The regularization term controls the complexity of the model, which helps avoid overfitting. Boosting trees use tree ensembles because they sum together the prediction of multiple trees .
Random forest classification is a type of Bagging, and it is one of the most powerful machine learning algorithms available currently. In decision tree classification, different subtrees can have a lot of structural similarities which can result in prediction outputs that are strongly correlated to each other. The random forest classifier reduces this correlation among the subtrees by limiting the features at each split point. So, instead of choosing a variable from all variables available, random forest searches for the variable that will minimize the error from a limited random sample of features. Random forest classifiers are fast and can work with data which is unbalanced or has missing values.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
RandomForestClassifier()
y_pred = rf.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test, y_pred))
precision recall f1-score support 0 0.35 0.23 0.28 27214 1 0.71 0.82 0.76 62785 accuracy 0.64 89999 macro avg 0.53 0.52 0.52 89999 weighted avg 0.60 0.64 0.61 89999
In Boosting , the selection of samples is done by giving more and more weight to hard-to-classify observations. Gradient boosting classification produces a prediction model in the form of an ensemble of weak predictive models, usually decision trees. It generalizes the model by optimizing for the arbitrary differentiable loss function. At each stage, regression trees fit on the negative gradient of binomial or multinomial deviance loss function.
In simple terminology, the gradient boosting classifier does the following:
Gradient boosting builds an ensemble of trees one by one.
Predictions of all individual trees are summed.
Discrepancy between target function and current ensemble prediction (i.e., residual) is reconstructed.
The next tree in the ensemble should complement existing trees and minimize the residual of the ensemble.
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_pred,y_test))
precision recall f1-score support 0 0.01 0.52 0.03 780 1 0.99 0.70 0.82 89219 accuracy 0.70 89999 macro avg 0.50 0.61 0.43 89999 weighted avg 0.99 0.70 0.81 89999
Here it is observed that recall has improved significantly, but precision has become almost 0. Dr. Joyita thought of applying the Deep Neural Network technique to check if it improves it further but since she has a meeting with the board scheduled to be held in next half an hour, she decides to leave with these findings. The readers of this blog might mail their opinion how to further improve the model and you will get our contact details here.
If you want to read more such case studies then click on Whom should you ask for donations for a charity or Identify if a patient has cancer
Regression problems can be found at House Price Prediction and Insurance Premium Prediction